CRISP-DM stands for cross-industry process for data mining. It provides a structured approach to planning a data mining project. It is a robust and well-proven methodology.
For Lab One we will use the first two portions of this methodology. We will start by focusing on understanding the objectives and requirements from a business perspective, and then using this knowledge to define the data problem and project plan. Next we will identify data quality issues, discover initial insights, or to detect interesting nuggets of information that might for a hypothesis for analysis.
In future labs we will execute all of the CRISP-DM steps.
The first stage of the CRISP-DM process is to understand what you want to accomplish from a business perspective. We will define our objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of our project.
This analysis of Home Credit's Default Risk dataset will focus on generating accurate loan default risk probabilities. Predicting loan defaults is essential to the profitability of banks and, given the competitive nature of the loan market, a bank that collects the right data can offer and service more loans. The target variable of the dataset is the binary label, 'TARGET', indicating whether the loan entered into default status or not.
List the resources available to the project including:
Assumptions
Constraits
Cost
Benefits
Business success criteria
Data mining success criteria
Produce project plan -
Use Home Credit current customer data to predict whether a potential client is capable of repayment of the loan requested. During this process we will determine the features that are most influencial in determining this target variable.
The second stage of the CRISP-DM process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you're going to integrate these.
Initial data collection report - List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.
# Import Libraries Required.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
# import custom code
from cleaning import read_clean_data, missing_values_table, load_bureau, create_newFeatures, merge_newFeatures
# some defaults
pd_max_rows_default = 60
# load data
# path = './application_train.csv'
# note that XNA is a encoding for NA interpret as np.nan
df = pd.read_csv('./application_train.csv',
na_values = ['XNA'])
#loading bureau dataset
bureau = pd.read_csv('./bureau.csv',
na_values = ['XNA'])
load_bureau()
#engineering features from bureau dataset
newFeatures = create_newFeatures(bureau)
newFeatures.head()
# load data
# path = './application_train.csv'
# note that XNA is a encoding for NA interpret as np.nan
df = pd.read_csv('./application_train.csv',
na_values = ['XNA'])
#loading bureau dataset
bureau = pd.read_csv('./bureau.csv',
na_values = ['XNA'])
#load_bureau()
Data description report - Describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements.
We will use two of the files from the total dataset.
application_train.csv: Information provided with each loan applicationbureau.csv: Information regarding clients from the credit bureausThere are 122 features and 307511 observations in application_train.csv.
df.shape
The following features are loan application attributes from application_train.csv.
Descriptions, types, and and units are given for each feature.
| Feature | Description | Type | Units |
|---|---|---|---|
| SK_ID_CURR | ID of loan in our sample | Category | N/A |
| TARGET | Target Variable (1 - difficulty paying loan, 0 - all other cases) | Category | N/A |
| NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | Category | N/A |
| CODE_GENDER | Gender of the client (M - male, F - female) | Category | N/A |
| FLAG_OWN_CAR | Flag if the client owns a car | Category | N/A |
| FLAG_OWN_REALTY | Flag if client owns a house or flat | Category | N/A |
| CNT_CHILDREN | Number of children the client has | Coninuous | N/A |
| AMT_INCOME_TOTAL | Income of the client | Coninuous | Currency |
| AMT_CREDIT | Credit amount of the loan | Coninuous | Currency |
| AMT_ANNUITY | Loan annuity | Coninuous | Currency |
| AMT_GOODS_PRICE | For consumer loans it is the price of the goods for which the loan is given | Coninuous | Currency |
| NAME_TYPE_SUITE | Who was accompanying client when he was applying for the loan | Category | N/A |
| NAME_INCOME_TYPE | Clients income type (businessman, working, maternity leave) | Category | N/A |
| NAME_EDUCATION_TYPE | Level of highest education the client achieved | Category | N/A |
| NAME_FAMILY_STATUS | Family status of the client | Category | N/A |
| NAME_HOUSING_TYPE | What is the housing situation of the client (renting, living with parents, ...) | Category | N/A |
| REGION_POPULATION_RELATIVE | Normalized population of region where client lives (higher number means the client lives in more populated region) | Coninuous | Days |
| DAYS_BIRTH | Client's age in days at the time of application | Coninuous | Days |
| DAYS_EMPLOYED | How many days before the application the person started current employment | Coninuous | Days |
| DAYS_REGISTRATION | How many days before the application did client change his registration | Coninuous | Days |
| DAYS_ID_PUBLISH | How many days before the application did client change the identity document with which he applied for the loan | Coninuous | Days |
| OWN_CAR_AGE | Age of client's car | Coninuous | Months |
| FLAG_MOBIL | Did client provide mobile phone (Y, N) | Category | N/A |
| FLAG_EMP_PHONE | Did client provide work phone (Y, N) | Category | N/A |
| FLAG_WORK_PHONE | Did client provide home phone (Y, N) | Category | N/A |
| FLAG_CONT_MOBILE | Was mobile phone reachable (Y, N) | Category | N/A |
| FLAG_PHONE | Did client provide home phone (Y, N) | Category | N/A |
| FLAG_EMAIL | Did client provide email (Y, N) | Category | N/A |
| CNT_FAM_MEMBERS | What kind of occupation does the client have | Category | N/A |
| OCCUPATION_TYPE | How many family members does client have | Category | N/A |
| REGION_RATING_CLIENT | Our rating of the region where client lives (1,2,3) | Category | N/A |
| REGION_RATING_CLIENT_W_CITY | Our rating of the region where client lives with taking city into account (1,2,3) | Category | N/A |
| WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply for the loan | Category | N/A |
| HOUR_APPR_PROCESS_START | Approximately at what hour did the client apply for the loan | Category | N/A |
| REG_REGION_NOT_LIVE_REGION | Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) | Category | N/A |
| REG_REGION_NOT_WORK_REGION | Flag if client's permanent address does not match work address (1=different, 0=same, at region level) | Category | N/A |
| LIVE_REGION_NOT_WORK_REGION | Flag if client's contact address does not match work address (1=different, 0=same, at region level) | Category | N/A |
| REG_CITY_NOT_LIVE_CITY | Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) | Category | N/A |
| REG_CITY_NOT_WORK_CITY | Flag if client's permanent address does not match work address (1=different, 0=same, at city level) | Category | N/A |
| LIVE_CITY_NOT_WORK_CITY | Flag if client's contact address does not match work address (1=different, 0=same, at city level) | Category | N/A |
| ORGANIZATION_TYPE | Type of organization where client works | Category | N/A |
| EXT_SOURCE_1 | Normalized score from external data source | Coninuous | N/A |
| EXT_SOURCE_2 | Normalized score from external data source | Coninuous | N/A |
| EXT_SOURCE_3 | Normalized score from external data source | Coninuous | N/A |
| OBS_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 30 DPD (days past due) default | Coninuous | N/A |
| DEF_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 30 DPD (days past due) | Coninuous | N/A |
| OBS_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 60 DPD (days past due) default | Coninuous | N/A |
| DEF_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 60 (days past due) DPD | Coninuous | N/A |
| DAYS_LAST_PHONE_CHANGE | How many days before application did client change phone | Coninuous | N/A |
| FLAG_DOCUMENT_2 | Did client provide document 2 | Category | N/A |
| FLAG_DOCUMENT_3 | Did client provide document 3 | Category | N/A |
| FLAG_DOCUMENT_4 | Did client provide document 4 | Category | N/A |
| FLAG_DOCUMENT_5 | Did client provide document 5 | Category | N/A |
| FLAG_DOCUMENT_6 | Did client provide document 6 | Category | N/A |
| FLAG_DOCUMENT_7 | Did client provide document 7 | Category | N/A |
| FLAG_DOCUMENT_8 | Did client provide document 8 | Category | N/A |
| FLAG_DOCUMENT_9 | Did client provide document 9 | Category | N/A |
| FLAG_DOCUMENT_10 | Did client provide document 10 | Category | N/A |
| FLAG_DOCUMENT_11 | Did client provide document 11 | Category | N/A |
| FLAG_DOCUMENT_12 | Did client provide document 12 | Category | N/A |
| FLAG_DOCUMENT_13 | Did client provide document 13 | Category | N/A |
| FLAG_DOCUMENT_14 | Did client provide document 14 | Category | N/A |
| FLAG_DOCUMENT_15 | Did client provide document 15 | Category | N/A |
| FLAG_DOCUMENT_16 | Did client provide document 16 | Category | N/A |
| FLAG_DOCUMENT_17 | Did client provide document 17 | Category | N/A |
| FLAG_DOCUMENT_18 | Did client provide document 18 | Category | N/A |
| FLAG_DOCUMENT_19 | Did client provide document 19 | Category | N/A |
| FLAG_DOCUMENT_20 | Did client provide document 20 | Category | N/A |
| FLAG_DOCUMENT_21 | Did client provide document 21 | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_HOUR | Number of enquiries to Credit Bureau about the client one hour before application | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_DAY | Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_WEEK | Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_MON | Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_QRT | Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_YEAR | Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application) | Category | N/A |
The following features were engineered from the loan application features (from application_train.csv).
| Engineered Feature | Description | Type | Units | Formula |
|---|---|---|---|---|
| CREDIT_INCOME_RATIO | The percentage credit relative to client's income | Numeric | N/A | AMT_CREDIT / AMT_INCOME_TOTAL |
| ANNUITY_INCOME_RATIO | The percentage annunity relative to client's income | Numeric | N/A | AMT_ANNUITY / AMT_INCOME_TOTAL |
| PERCENT_EMPLOYED_TO_AGE | The fraction of client's days employed. | Numeric | N/A | DAYS_EMPLOYED / DAYS_BIRTH |
The following features were engineered from the bureau features (from bureau.csv).
| Engineered Feature | Description | Type | Units | Formula |
|---|---|---|---|---|
| LOAN_COUNT | The total number of accounts, active and closed. | Numeric | N/A | length of CREDIT_ACTIVE, grouped by loan ID |
| CREDIT_ACTIVE | A count of active credit accounts by loan ID | Numeric | N/A | len(bureau['CREDIT_ACTIVE'] == 'Active') |
| CREDIT_DAY_OVERDUE | A count of days overdue for active credit accounts by loan ID | Numeric | N/A | sum of CREDIT_DAY_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM | Total credit available from active accounts | Numeric | N/A | sum of AMT_CREDIT_SUM for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM_DEBT | Total debt of active accounts | Numeric | N/A | sum of AMT_CREDIT_SUM_DEBT for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM_LIMIT | Overall credit limit of active accounts | Numeric | N/A | sum of AMT_CREDIT_SUM_LIMIT for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM_OVERDUE | Total amount overdue | Numeric | N/A | sum of AMT_CREDIT_SUM_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID |
Note: All this resolutions to data quality described in this section are implemented in cleaning.py,
which is used to load the cleaned dataset.
In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column
While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem
# create missing values table with all rows
pd.set_option('display.max_rows', 122)
missing_values_table(df)
# return row display setting to default
pd.set_option('display.max_rows', pd_max_rows_default)
Each feature or set of features will be discussed in descending order of the rate of missing values.
Building Features with Missing Values
A large number of the features with values are normlaized infromation about the build where the client lives (such as ENTRANCES_MODE or COMMONAREA_AVG).
It is plausible that only some of these features exist in a given client's building.
It is also plausible that some of the records are missing (not provided by the client).
There does not appear to be an indication if values are missing or not applicable to the client.
All of these values have a missing rate above 40%.
Missing values in OWN_CAR_AGE
This feature refers to the age of cars owned by the client.
Approximately 66.0% of the values are empty.
However, there is also a feature FLAG_OWN_CAR, which indicates that the client owns a car.
It is reasonable to expect that values will be missing if the client does not own a car.
For clients reporting owning a car, all but 5 car age values are present.
For clients reporting not owning a car, no car ages are reported.
Since the missing rate is actually very small, these missing values could be imputed.
# get indexes of OWN_CAR_AGE that are not NA
car_age_indexes = df[df.OWN_CAR_AGE.notnull()].index
# get indexes of FLAG_OWN_CAR with value equal to Y (client owns car)
owns_car_indexes = df.query('FLAG_OWN_CAR == "Y"').index
print('There are {} records with ages for cars'.format(car_age_indexes.shape[0]))
print('There are {} records indicating that the client owns a car'.format(owns_car_indexes.shape[0]))
# get the number of non-null car age values on records where client does not list a car
car_age_not_own_car = df.query('FLAG_OWN_CAR == "N"').OWN_CAR_AGE.notnull().sum()
print('There are {} car ages reported for clients that report NOT owning a car'.format(car_age_not_own_car))
Missing values in EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3
Since there are zeros in these columns,
we expect that the missing values represent lack of the external source these clients.
Therefore, we will add an encoding feature EXT_SOURCE_<number>_AV
for each of the external source features that represents the presence
of external information.
These original columns will only be used as an interaction with these
encoding features and the missing values will be filled with zero.
Missing values in OCCUPATION_TYPE
There does not appear to be a indication that OCCUPATION_TYPE is systematically missing.
We will assume that the client did not provide this information and impute with a new categorical level Unknown.
OCCUPATION_TYPE has a missing rate of 31.3%.
Missing values in ORGANIZATION_TYPE
Later in the outliers section, it is shown that the NAs in ORGANIZATION_TYPE are associated with occupations listed as 'Unemployed' or 'Pensioners'.
We will take this to mean that these clients are not assocated with any employer and will impute these NAs with 'None'.
ORGANIZATION_TYPE has a missing rate of 18%.
Missing values for AMT_REQ_CREDIT_BUREAU Features
These features indicate the number of credit enquiries on the client in a
given time period: hour, day, week, month, quarter, and year.
There does not appear to be a systematic reason the missing values in
these features.
We will treat these records as if there are no credit enquires for these
records and impute with zero.
These features have a missing rate of 13.5%.
Remaining Features with Low Missing Rate
The remaining missing features are missing at rates below 0.5%.
Imputation on these features should have only a small impact on the total dataset.
We will use standard imputation strategies for these features:
imputation of the mode for categorical features and imputation of the median for continuous features.
We are imputing with continuous features median rather than mode because the continuous features are skewed and
median is not impacted by large values in the features.
pd.set_option('display.max_rows', 122)
data = read_clean_data()
missing_values_table(data)
#engineering features from bureau dataset
newFeatures = create_newFeatures(bureau)
newFeatures.head()
newFeatures = newFeatures.fillna(0)
data = data.merge(newFeatures, on = 'SK_ID_CURR', how = 'left')
#data = data.fillna(0)
data.head()
At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:
https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
DAYS_EMPLOYED¶There are a large number of entries for DAYS_EMPLOYED outside the main distribution.
These entries are at value 365243 and there are 55374 instances.
df.DAYS_EMPLOYED.hist(bins = 50);
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.shape
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.head()
These high values for DAYS_EMPLOYED appear to be associated with clients that are "Pensioners" or Unemployed and do not list an employment orgainization (ORGANIZATION_TYPE).
# get the instances with NAME_INCOME_TYPE either Pensioner or Unemployed
filtered_index = df.query('NAME_INCOME_TYPE == "Pensioner" | NAME_INCOME_TYPE == "Unemployed"')
# filter to NAs for ORGANIZATION_TYPE and get the index of the array
filtered_index = df[df.ORGANIZATION_TYPE.isna()].index
# get the indexes of the high values
high_val_index = df.query("DAYS_EMPLOYED >= 100000").index
# assert that indexes are the same
if np.equal(filtered_index, high_val_index).all():
print('Index of queried values are the same.')
else:
print('Indexes of queried values are different.')
Possible Solution
A possible solution to this data quality issue would be to add a None level to ORGANIZATION_TYPE and only use DAYS_EMPLOYED as an interaction with ORGANIZATION_TYPE.
All the cleaning discussed in the sections above are implemented in cleaning.py.
This script contains a function (read_clean_data) to apply the cleaning steps and return the cleaned dataset for work.
Details
Sample Output
data = read_clean_data()
data.head(2)
During this stage you'll address data mining questions using querying, data visualization and reporting techniques. These may include:
These analyses may directly address your data mining goals. They may also contribute to or refine the data description and quality reports, and feed into the transformation and other data preparation steps needed for further analysis.
We used pandas profiler to explore the data sets range, mode, mean, median, variance, counts, etc. After initial assessment we will focus on the most important attributes and describe what their meanin and add interesting observations we have noted for the dataset. NOTE: Install pandas_profiling into your Conda environment by running "conda install -c conda-forge pandas-profiling"
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
# Generate the pandas_profile report
profile = ProfileReport(data, title='Pandas Profiling Report')
#Run interactive report within Jupyter
profile
profile.to_file(output_file="output.html")
profile.to_notebook_iframe()